BigQuery に JSON 形式のファイルデータをロードする場合の挙動や制限事項を確認してみた
こんにちは、みかみです。
やりたいこと
- BigQuery に JSON データをロードしたい
- ネスト構造のある JSON データをロードする場合でも、先にテーブル作成することなくデータをロードできるのか試してみたい
- テーブルがあってもなくても、同じコードを使って JSON データをロードできるのか知りたい
- ロードデータが JSON 形式の場合、どんな圧縮形式やエンコードがサポートされているのか知りたい
- サポートされていない JSON データをロードしようとした場合どんな挙動になるのか知りたい
JSON データを準備
以下のサイトで、サンプルの JSON データを作成しました。
boolean や float などのデータ型があり、ネスト構図もある、以下の JSON データができました。
[ { "_id": "5e9e78da2a1ce10414295d99", "index": 0, "guid": "936bae0e-97d4-4bcc-b5f1-00c22f79817d", "isActive": true, "balance": "$3,845.13", "picture": "http://placehold.it/32x32", "age": 27, "eyeColor": "green", "name": "Judith Dickson", "gender": "female", "company": "ZYTREX", "email": "judithdickson@zytrex.com", "phone": "+1 (842) 457-2021", "address": "860 Emerald Street, Silkworth, Michigan, 9197", "about": "Ex quis fugiat fugiat velit ut incididunt officia veniam cupidatat duis laborum cupidatat. Exercitation magna eiusmod adipisicing sunt esse incididunt. Adipisicing laborum aliqua qui enim ut Lorem sint pariatur cupidatat. Aliquip minim incididunt cupidatat elit. Incididunt occaecat exercitation mollit aliqua quis. Velit sunt ut cupidatat mollit cupidatat eiusmod excepteur laboris.\r\n", "registered": "2020-02-08T01:46:15 -09:00", "latitude": -5.142501, "longitude": -119.229143, "tags": [ "proident", "Lorem", "est", "officia", "non", "ad", "id" ], "friends": [ { "id": 0, "name": "Opal Keller" }, { "id": 1, "name": "Duncan Bradshaw" }, { "id": 2, "name": "Theresa Mooney" } ], "greeting": "Hello, Judith Dickson! You have 4 unread messages.", "favoriteFruit": "apple" }, (省略) { "_id": "5e9e78da72c32bfc6f591542", "index": 6, "guid": "641b44d6-4e44-452e-90ed-44d5458ccff0", "isActive": false, "balance": "$2,672.01", "picture": "http://placehold.it/32x32", "age": 21, "eyeColor": "green", "name": "Burton French", "gender": "male", "company": "TALAE", "email": "burtonfrench@talae.com", "phone": "+1 (868) 502-2898", "address": "187 Greenpoint Avenue, Tilden, New Mexico, 6712", "about": "Enim adipisicing quis mollit eiusmod irure do. Do voluptate adipisicing voluptate velit minim sunt do. Dolor officia ullamco labore qui aliqua culpa ipsum ex consectetur consectetur irure cillum minim.\r\n", "registered": "2017-08-06T05:21:27 -09:00", "latitude": 57.758086, "longitude": -113.347217, "tags": [ "commodo", "ipsum", "officia", "duis", "nulla", "mollit", "ex" ], "friends": [ { "id": 0, "name": "Carolina Fischer" }, { "id": 1, "name": "Leola Brewer" }, { "id": 2, "name": "Anita Michael" } ], "greeting": "Hello, Burton French! You have 5 unread messages.", "favoriteFruit": "apple" } ]
JSON データを BigQuery にロード
BigQuery では、データロード時にあらかじめテーブルを作成しておかなくても、データからテーブルスキーマを自動検出してくれる機能があります。
Python クライアントライブラリを使用した以下のコードで、テーブルスキーマ自動検出モードで GCS に配置した JSON データをロードしてみます。
from google.cloud import bigquery client = bigquery.Client() dataset_id = 'dataset_2' dataset_ref = client.dataset(dataset_id) job_config = bigquery.LoadJobConfig() job_config.autodetect = True job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON uri = "gs://test-mikami/data_test/generated.json" load_job = client.load_table_from_uri( uri, dataset_ref.table("load_json"), job_config=job_config ) # API request print("Starting job {}".format(load_job.job_id)) load_job.result() # Waits for table load to complete. print("Job finished.") destination_table = client.get_table(dataset_ref.table("load_json")) print("Loaded {} rows.".format(destination_table.num_rows))
(test_bq) [ec2-user@ip-10-0-43-239 test_load]$ python load_json.py Starting job 21b89751-cb3b-46f2-bda9-2857cf413092 Traceback (most recent call last): File "load_json.py", line 18, in <module> load_job.result() # Waits for table load to complete. File "/home/ec2-user/test_bq/lib/python3.7/site-packages/google/cloud/bigquery/job.py", line 818, in result return super(_AsyncJob, self).result(timeout=timeout) File "/home/ec2-user/test_bq/lib/python3.7/site-packages/google/api_core/future/polling.py", line 127, in result raise self._exception google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: Failed to parse JSON: No object found when new array is started.; BeginArray returned false
JSON パースでエラーになったようです。。
ドキュメントをよく読んでみると
JSON データは改行区切りである必要があります。各 JSON オブジェクトはファイル内でそれぞれ別の行に配置されている必要があります。
あ、JSONL( LDJSON )じゃないとだめなんですね(さらっと読み飛ばしてました。。
JSON ファイルを、以下のように改行区切りに修正しました。
{"_id":"5e9e78da2a1ce10414295d99","index":0,"guid":"936bae0e-97d4-4bcc-b5f1-00c22f79817d","isActive":true,"balance":"$3,845.13","picture":"http://placehold.it/32x32","age":27,"eyeColor":"green","name":"Judith Dickson","gender":"female","company":"ZYTREX","email":"judithdickson@zytrex.com","phone":"+1 (842) 457-2021","address":"860 Emerald Street, Silkworth, Michigan, 9197","about":"Ex quis fugiat fugiat velit ut incididunt officia veniam cupidatat duis laborum cupidatat. Exercitation magna eiusmod adipisicing sunt esse incididunt. Adipisicing laborum aliqua qui enim ut Lorem sint pariatur cupidatat. Aliquip minim incididunt cupidatat elit. Incididunt occaecat exercitation mollit aliqua quis. Velit sunt ut cupidatat mollit cupidatat eiusmod excepteur laboris.\r\n","registered":"2020-02-08T01:46:15 -09:00","latitude":-5.142501,"longitude":-119.229143,"tags":["proident","Lorem","est","officia","non","ad","id"],"friends":[{"id":0,"name":"Opal Keller"},{"id":1,"name":"Duncan Bradshaw"},{"id":2,"name":"Theresa Mooney"}],"greeting":"Hello, Judith Dickson! You have 4 unread messages.","favoriteFruit":"apple"} (省略) {"_id":"5e9e78da72c32bfc6f591542","index":6,"guid":"641b44d6-4e44-452e-90ed-44d5458ccff0","isActive":false,"balance":"$2,672.01","picture":"http://placehold.it/32x32","age":21,"eyeColor":"green","name":"Burton French","gender":"male","company":"TALAE","email":"burtonfrench@talae.com","phone":"+1 (868) 502-2898","address":"187 Greenpoint Avenue, Tilden, New Mexico, 6712","about":"Enim adipisicing quis mollit eiusmod irure do. Do voluptate adipisicing voluptate velit minim sunt do. Dolor officia ullamco labore qui aliqua culpa ipsum ex consectetur consectetur irure cillum minim.\r\n","registered":"2017-08-06T05:21:27 -09:00","latitude":57.758086,"longitude":-113.347217,"tags":["commodo","ipsum","officia","duis","nulla","mollit","ex"],"friends":[{"id":0,"name":"Carolina Fischer"},{"id":1,"name":"Leola Brewer"},{"id":2,"name":"Anita Michael"}],"greeting":"Hello, Burton French! You have 5 unread messages.","favoriteFruit":"apple"}
再度ロードしてみると
(test_bq) [ec2-user@ip-10-0-43-239 test_load]$ python load_json.py Starting job 089f59e3-7606-4921-bc00-60faa517453e Job finished. Loaded 7 rows.
ちゃんとロードできたようです。
データを確認してみます。
(test_bq) [ec2-user@ip-10-0-43-239 test_load]$ bq query --use_legacy_sql=false 'SELECT * FROM `cm-da-mikami-yuki-258308`.dataset_2.load_json ORDER BY index' Waiting on bqjob_r37352f72f10beb0a_000001719c9ed545_1 ... (0s) Current status: DONE +---------------+-----------------------------------------------------------------------------------------------------------+------------+---------------------+-----------------------------------------------------+-------------+-------------------+----------------------------------------------------+----------------------------+----------------------------------------------------------------------+-----------+----------------+----------+--------+-----+---------------------------+---------+----------+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+--------------------------+ | favoriteFruit | friends | latitude | registered | address | longitude | phone | greeting | email | tags | company | name | eyeColor | gender | age | picture | balance | isActive | guid | about | index | _id | +---------------+-----------------------------------------------------------------------------------------------------------+------------+---------------------+-----------------------------------------------------+-------------+-------------------+----------------------------------------------------+----------------------------+----------------------------------------------------------------------+-----------+----------------+----------+--------+-----+---------------------------+---------+----------+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+--------------------------+ | apple | [{"name":"Opal Keller","id":"0"},{"name":"Duncan Bradshaw","id":"1"},{"name":"Theresa Mooney","id":"2"}] | -5.142501 | 2020-02-08 10:46:15 | 860 Emerald Street, Silkworth, Michigan, 9197 | -119.229143 | +1 (842) 457-2021 | Hello, Judith Dickson! You have 4 unread messages. | judithdickson@zytrex.com | ["proident","Lorem","est","officia","non","ad","id"] | ZYTREX | Judith Dickson | green | female | 27 | http://placehold.it/32x32 | 3845.13 | true | 936bae0e-97d4-4bcc-b5f1-00c22f79817d | Ex quis fugiat fugiat velit ut incididunt officia veniam cupidatat duis laborum cupidatat. Exercitation magna eiusmod adipisicing sunt esse incididunt. Adipisicing laborum aliqua qui enim ut Lorem sint pariatur cupidatat. Aliquip minim incididunt cupidatat elit. Incididunt occaecat exercitation mollit aliqua quis. Velit sunt ut cupidatat mollit cupidatat eius | 0 | 5e9e78da2a1ce10414295d99 | | | | | | | | | | | | | | | | | | | | | | | | | strawberry | [{"name":"Polly Phelps","id":"0"},{"name":"Tamra Carey","id":"1"},{"name":"Mercado Barron","id":"2"}] | -67.067065 | 2016-07-18 18:47:59 | 689 Cadman Plaza, Derwood, California, 4868 | -79.439951 | +1 (882) 501-2437 | Hello, Charity Mills! You have 9 unread messages. | charitymills@isologia.com | ["enim","ullamco","aute","adipisicing","irure","nostrud","ea"] | ISOLOGIA | Charity Mills | green | female | 34 | http://placehold.it/32x32 | 2541.93 | true | 34adf2b7-fd94-4413-96eb-656c0966251a | Incididunt eu esse ad sit laborum qui sint esse exercitatio | 1 | 5e9e78da668e337086eaff73 | | | | | | | | | | | | | | | | | | | | | | | | | apple | [{"name":"Kim Nelson","id":"0"},{"name":"Booth Mueller","id":"1"},{"name":"Brewer Hooper","id":"2"}] | -64.729569 | 2018-04-27 21:25:15 | 363 Taaffe Place, Defiance, Illinois, 191 | 91.426433 | +1 (800) 508-2319 | Hello, Jannie Morgan! You have 6 unread messages. | janniemorgan@slumberia.com | ["proident","dolor","quis","deserunt","cillum","adipisicing","sint"] | SLUMBERIA | Jannie Morgan | brown | female | 35 | http://placehold.it/32x32 | 3830.75 | true | 04debd69-618d-45c8-b38d-60660f8a086c | Duis velit exercitation irure dolor consequat reprehenderit adipisicing id. Irure do culpa incididunt Lorem in irure irure dolore cupidatat. Proident ipsum ex qui laborum ut. Aute reprehenderit pariatur minim laboris. Est aliquip laboris labore aliquip dolore nostrud qui amet non magna voluptate ad. Lorem ullamco quis sunt mollit. Occaecat duis elit esse siteiusmod aliqua. | 2 | 5e9e78da6d422ce2fd29895c | | | | | | | | | | | | | | | | | | | | | | | | | strawberry | [{"name":"Vaughn Franks","id":"0"},{"name":"Misty Lawrence","id":"1"},{"name":"Ayala Solis","id":"2"}] | -9.94468 | 2016-06-29 11:36:01 | 994 Saratoga Avenue, Evergreen, Alaska, 2454 | 26.303266 | +1 (832) 564-3999 | Hello, Emilia Burch! You have 9 unread messages. | emiliaburch@halap.com | ["aliqua","ut","ipsum","dolore","elit","id","id"] | HALAP | Emilia Burch | brown | female | 25 | http://placehold.it/32x32 | 2255.86 | false | 69d23d6d-b337-4935-a212-bea42087acc0 | Qui ad fugiat commodo in adipisicing nulla in quis ullamco. Elit minim cillum ex ex. Ut occaecat consectetur voluptate magna duis sit veniam consequat occaecat et consequat aliquip. Lorem et consectetur ea volut quis ea velit nulla sunt ea. | 3 | 5e9e78daf9617236cb600016 | | | | | | | | | | | | | | | | | | | | | | | | | strawberry | [{"name":"Lynn Simon","id":"0"},{"name":"Cummings Charles","id":"1"},{"name":"Karen Cherry","id":"2"}] | 35.3641 | 2014-08-27 18:53:16 | 571 Withers Street, Siglerville, Pennsylvania, 6878 | -53.866185 | +1 (870) 460-2091 | Hello, Tracy Cash! You have 5 unread messages. | tracycash@cytrex.com | ["ipsum","laboris","anim","non","minim","tempor","ea"] | CYTREX | Tracy Cash | green | female | 33 | http://placehold.it/32x32 | 2567.27 | false | 5672d7cb-c597-476b-bbaf-4a6709bf506b | Minim voluptate labore labore qui irure cupidatat officia adipisicing nisi eiusmod. Veniam pariatur dolore esse dolore aliquip ut duis irure exercitation consectetur ad voluptate deserunt. Dolore esse esse suntatur quis nisi aliqua qui veniam cillum consectetur. | 4 | 5e9e78da017481351e9c4b9b | | | | | | | | | | | | | | | | | | | | | | | | | banana | [{"name":"Karina Davenport","id":"0"},{"name":"Meyers Chaney","id":"1"},{"name":"Tyson Mack","id":"2"}] | -13.703048 | 2017-05-29 18:28:57 | 573 Louise Terrace, Crenshaw, Puerto Rico, 7823 | -62.551788 | +1 (883) 403-3327 | Hello, Shana Marshall! You have 6 unread messages. | shanamarshall@velity.com | ["aute","esse","nostrud","eu","labore","ullamco","aliqua"] | VELITY | Shana Marshall | blue | female | 28 | http://placehold.it/32x32 | 1351.45 | true | 9b209aee-5f84-46e5-b351-3d28ff7a5406 | Nulla minim duis velit aute mollit do aliquip id duis. Ullamco quis minim do elit voluptate ipsum et. Cupidatat non eu pariatur enim ea mollit esse cillum. Irure exercitation nulla irure magna enim nisi excepte | 5 | 5e9e78da567659cf2f112f14 | | | | | | | | | | | | | | | | | | | | | | | | | apple | [{"name":"Carolina Fischer","id":"0"},{"name":"Leola Brewer","id":"1"},{"name":"Anita Michael","id":"2"}] | 57.758086 | 2017-08-06 14:21:27 | 187 Greenpoint Avenue, Tilden, New Mexico, 6712 | -113.347217 | +1 (868) 502-2898 | Hello, Burton French! You have 5 unread messages. | burtonfrench@talae.com | ["commodo","ipsum","officia","duis","nulla","mollit","ex"] | TALAE | Burton French | green | male | 21 | http://placehold.it/32x32 | 2672.01 | false | 641b44d6-4e44-452e-90ed-44d5458ccff0 | Enim adipisicing quis mollit eiusmod irure do. Do voluptate | 6 | 5e9e78da72c32bfc6f591542 | | | | | | | | | | | | | | | | | | | | | | | | +---------------+-----------------------------------------------------------------------------------------------------------+------------+---------------------+-----------------------------------------------------+-------------+-------------------+----------------------------------------------------+----------------------------+----------------------------------------------------------------------+-----------+----------------+----------+--------+-----+---------------------------+---------+----------+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+--------------------------+
配列やタイムスタンプのデータも正常にロードされています。
データロード時に自動作成されたテーブルのスキーマも確認してみます。
(test_bq) [ec2-user@ip-10-0-43-239 test_load]$ bq show --schema --format=prettyjson cm-da-mikami-yuki-258308:dataset_2.load_json [ { "mode": "NULLABLE", "name": "favoriteFruit", "type": "STRING" }, { "fields": [ { "mode": "NULLABLE", "name": "name", "type": "STRING" }, { "mode": "NULLABLE", "name": "id", "type": "INTEGER" } ], "mode": "REPEATED", "name": "friends", "type": "RECORD" }, { "mode": "NULLABLE", "name": "latitude", "type": "FLOAT" }, { "description": "bq-datetime", "mode": "NULLABLE", "name": "registered", "type": "TIMESTAMP" }, { "mode": "NULLABLE", "name": "address", "type": "STRING" }, { "mode": "NULLABLE", "name": "longitude", "type": "FLOAT" }, { "mode": "NULLABLE", "name": "phone", "type": "STRING" }, { "mode": "NULLABLE", "name": "greeting", "type": "STRING" }, { "mode": "NULLABLE", "name": "email", "type": "STRING" }, { "mode": "REPEATED", "name": "tags", "type": "STRING" }, { "mode": "NULLABLE", "name": "company", "type": "STRING" }, { "mode": "NULLABLE", "name": "name", "type": "STRING" }, { "mode": "NULLABLE", "name": "eyeColor", "type": "STRING" }, { "mode": "NULLABLE", "name": "gender", "type": "STRING" }, { "mode": "NULLABLE", "name": "age", "type": "INTEGER" }, { "mode": "NULLABLE", "name": "picture", "type": "STRING" }, { "mode": "NULLABLE", "name": "balance", "type": "FLOAT" }, { "mode": "NULLABLE", "name": "isActive", "type": "BOOLEAN" }, { "mode": "NULLABLE", "name": "guid", "type": "STRING" }, { "mode": "NULLABLE", "name": "about", "type": "STRING" }, { "mode": "NULLABLE", "name": "index", "type": "INTEGER" }, { "mode": "NULLABLE", "name": "_id", "type": "STRING" } ]
期待通り、データ型やネスト構造を自動検出して、ロードデータに合うテーブルが作成されたことが確認できました。
JSON データを BigQuery に追加ロード
先ほどはテーブルが存在しない状態から、ロードと同時にテーブル作成して JSON データをロードしましたが、テーブルがなければ作成し、あれば追記(or リプレイス)するケースを想定しています。
Python クライアントライブラリを使用する場合、write_disposition の指定で以下の3つのモードを指定できます。
- WRITE_APPEND:データ追加
- WRITE_TRUNCATE:データリプレイス
- WRITE_EMPTY:エラー終了
- google.cloud.bigquery.job.LoadJobConfig | Python Client for Google BigQuery
- google.cloud.bigquery.job.WriteDisposition | Python Client for Google BigQuery
まずは write_disposition の指定ありでも、テーブルを新規作成してデータロードしてくれるのか確認してみます。
from google.cloud import bigquery client = bigquery.Client() dataset_id = 'dataset_2' dataset_ref = client.dataset(dataset_id) job_config = bigquery.LoadJobConfig() job_config.autodetect = True job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON job_config.write_disposition = 'WRITE_APPEND' uri = "gs://test-mikami/data_test/generated_l.json" load_job = client.load_table_from_uri( uri, dataset_ref.table("load_json_mode"), job_config=job_config ) # API request print("Starting job {}".format(load_job.job_id)) load_job.result() # Waits for table load to complete. print("Job finished.") destination_table = client.get_table(dataset_ref.table("load_json_mode")) print("Loaded {} rows.".format(destination_table.num_rows))
(test_bq) [ec2-user@ip-10-0-43-239 test_load]$ python load_json_mode.py Starting job 8d4cb964-dda5-4d3a-89f0-028e0004a963 Job finished. Loaded 7 rows.
write_disposition 指定があっても、自動検出でのテーブル新規作成は可能でした。
続いて、そのまま追記モードで同じデータをロードしてみます。
(test_bq) [ec2-user@ip-10-0-43-239 test_load]$ python load_json_mode.py Starting job 42ed0842-4715-48ed-8315-c7fd991ade0f Job finished. Loaded 14 rows.
ロード済みデータが 14 rows に増え、ちゃんと追加でロードしてくれました。
次に WRITE_TRUNCATE に変更してロード実行してみます。
job_config = bigquery.LoadJobConfig() job_config.autodetect = True job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON #job_config.write_disposition = 'WRITE_APPEND' job_config.write_disposition = 'WRITE_TRUNCATE' uri = "gs://test-mikami/data_test/generated_l.json" load_job = client.load_table_from_uri( uri, dataset_ref.table("load_json_mode"), job_config=job_config ) # API request
(test_bq) [ec2-user@ip-10-0-43-239 test_load]$ python load_json_mode.py Starting job 2023a267-c821-4e1a-b1d3-a00197a97f52 Job finished. Loaded 7 rows.
指定通り、既存データが削除され、ファイルデータがロードされました。
最後に、WRITE_EMPTY の挙動を確認してみます。
(test_bq) [ec2-user@ip-10-0-43-239 test_load]$ python load_json_mode.py Starting job 5452865a-5321-419e-aaf0-48cf85a1add4 Traceback (most recent call last): File "load_json_mode.py", line 19, in <module> load_job.result() # Waits for table load to complete. File "/home/ec2-user/test_bq/lib/python3.7/site-packages/google/cloud/bigquery/job.py", line 818, in result return super(_AsyncJob, self).result(timeout=timeout) File "/home/ec2-user/test_bq/lib/python3.7/site-packages/google/api_core/future/polling.py", line 127, in result raise self._exception google.api_core.exceptions.Conflict: 409 Already Exists: Table cm-da-mikami-yuki-258308:dataset_2.load_json_mode
ドキュメントに記載の通り、ロードエラーが返ることが確認できました。
圧縮済みの JSON データを BigQuery にロード
先ほどロードした JSON ファイルを gzip 圧縮して、同様にロードしてみます。
(test_bq) [ec2-user@ip-10-0-43-239 test_load]$ python load_json_gzip.py Starting job 4055008e-0d58-4c5d-8eab-c7a6fa489083 Job finished. Loaded 7 rows.
正常にロードされたことが確認できました。
(test_bq) [ec2-user@ip-10-0-43-239 test_load]$ bq query --use_legacy_sql=false 'SELECT * FROM `cm-da-mikami-yuki-258308`.dataset_2.load_json_gzip ORDER BY index LIMIT 3' Waiting on bqjob_r38def77188596913_000001719cb185b8_1 ... (0s) Current status: DONE +---------------+----------------------------------------------------------------------------------------------------------+------------+---------------------+-----------------------------------------------+-------------+-------------------+----------------------------------------------------+----------------------------+----------------------------------------------------------------------+-----------+----------------+----------+--------+-----+---------------------------+---------+----------+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+--------------------------+ | favoriteFruit | friends | latitude | registered | address | longitude | phone | greeting | email | tags | company | name | eyeColor | gender | age | picture | balance | isActive | guid | about | index | _id | +---------------+----------------------------------------------------------------------------------------------------------+------------+---------------------+-----------------------------------------------+-------------+-------------------+----------------------------------------------------+----------------------------+----------------------------------------------------------------------+-----------+----------------+----------+--------+-----+---------------------------+---------+----------+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+--------------------------+ | apple | [{"name":"Opal Keller","id":"0"},{"name":"Duncan Bradshaw","id":"1"},{"name":"Theresa Mooney","id":"2"}] | -5.142501 | 2020-02-08 10:46:15 | 860 Emerald Street, Silkworth, Michigan, 9197 | -119.229143 | +1 (842) 457-2021 | Hello, Judith Dickson! You have 4 unread messages. | judithdickson@zytrex.com | ["proident","Lorem","est","officia","non","ad","id"] | ZYTREX | Judith Dickson | green | female | 27 | http://placehold.it/32x32 | 3845.13 | true | 936bae0e-97d4-4bcc-b5f1-00c22f79817d | Ex quis fugiat fugiat velit ut incididunt officia veniam cupidatat duis laborum cupidatat. Exercitation magna eiusmod adipisicing sunt esse incididunt. Adipisicing laborum aliqua qui enim ut Lorem sint pariatur cupidatat. Aliquip minim incididunt cupidatat elit. Incididunt occaecat exercitation mollit aliqua quis. Velit sunt ut cupidatat mollit cupidatat eiusmod exc | 0 | 5e9e78da2a1ce10414295d99 | | | | | | | | | | | | | | | | | | | | | | | | | strawberry | [{"name":"Polly Phelps","id":"0"},{"name":"Tamra Carey","id":"1"},{"name":"Mercado Barron","id":"2"}] | -67.067065 | 2016-07-18 18:47:59 | 689 Cadman Plaza, Derwood, California, 4868 | -79.439951 | +1 (882) 501-2437 | Hello, Charity Mills! You have 9 unread messages. | charitymills@isologia.com | ["enim","ullamco","aute","adipisicing","irure","nostrud","ea"] | ISOLOGIA | Charity Mills | green | female | 34 | http://placehold.it/32x32 | 2541.93 | true | 34adf2b7-fd94-4413-96eb-656c0966251a | Incididunt eu esse ad sit laborum qui sint esse exercitation id qu | 1 | 5e9e78da668e337086eaff73 | | | | | | | | | | | | | | | | | | | | | | | | | apple | [{"name":"Kim Nelson","id":"0"},{"name":"Booth Mueller","id":"1"},{"name":"Brewer Hooper","id":"2"}] | -64.729569 | 2018-04-27 21:25:15 | 363 Taaffe Place, Defiance, Illinois, 191 | 91.426433 | +1 (800) 508-2319 | Hello, Jannie Morgan! You have 6 unread messages. | janniemorgan@slumberia.com | ["proident","dolor","quis","deserunt","cillum","adipisicing","sint"] | SLUMBERIA | Jannie Morgan | brown | female | 35 | http://placehold.it/32x32 | 3830.75 | true | 04debd69-618d-45c8-b38d-60660f8a086c | Duis velit exercitation irure dolor consequat reprehenderit adipisicing id. Irure do culpa incididunt Lorem in irure irure dolore cupidatat. Proident ipsum ex qui laborum ut. Aute reprehenderit pariatur minim laboris. Est aliquip laboris labore aliquip dolore nostrud qui amet non magna voluptate ad. Lorem ullamco quis sunt mollit. Occaecat duis elit esse sit elit ip aliqua. | 2 | 5e9e78da6d422ce2fd29895c | | | | | | | | | | | | | | | | | | | | | | | | +---------------+----------------------------------------------------------------------------------------------------------+------------+---------------------+-----------------------------------------------+-------------+-------------------+----------------------------------------------------+----------------------------+----------------------------------------------------------------------+-----------+----------------+----------+--------+-----+---------------------------+---------+----------+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+--------------------------+
ただし、ドキュメントによると、圧縮&非圧縮ファイルを同一のジョブでロードすることはできず、また非圧縮ファイルよりも圧縮ファイルのロードには処理時間がかかるとのことです。
続いて、圧縮形式を、サポートされていない zip に変更してロードしてみます。
(test_bq) [ec2-user@ip-10-0-43-239 test_load]$ python load_json_gzip.py Starting job 6b22fd56-4522-48e0-bf5a-a45afc97ad12 Traceback (most recent call last): File "load_json_gzip.py", line 18, in <module> load_job.result() # Waits for table load to complete. File "/home/ec2-user/test_bq/lib/python3.7/site-packages/google/cloud/bigquery/job.py", line 818, in result return super(_AsyncJob, self).result(timeout=timeout) File "/home/ec2-user/test_bq/lib/python3.7/site-packages/google/api_core/future/polling.py", line 127, in result raise self._exception google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: Failed to parse JSON: Unexpected token; Parser terminated before end of string
ドキュメントには JSON データでは gzip 圧縮サポートの記載しかなかったので、zip 形式の圧縮ファイルはやはりエラーになりました。
JSON パースエラーとのことで、慣れないうちは圧縮形式が原因とは分かりにくい内容のエラーメッセージですが、JSON フォーマットが不正だった場合とはメッセージが少し違いました。
UTF-8 以外の文字エンコードの JSON データを BigQuery にロード
ドキュメントによると JSON ファイルの文字エンコードは UTF-8 しかサポートされていないようですが、UTF-8 以外の JSON ファイルをロードしようとした場合の挙動を確認してみます。
サンプルファイルを UTF-16 に変換しました。
(test_bq) [ec2-user@ip-10-0-43-239 test_load]$ iconv -f UTF8 -t UTF16 generated_l.json > generated_l_utf16.json (test_bq) [ec2-user@ip-10-0-43-239 test_load]$ file generated_l_utf16.json generated_l_utf16.json: Little-endian UTF-16 Unicode text, with very long lines
ロードしてみると、やはり JSON パースエラーです。
(test_bq) [ec2-user@ip-10-0-43-239 test_load]$ python load_json_other_encode.py Starting job 75b5ce65-c180-4ec1-8160-fb8582a9e446 Traceback (most recent call last): File "load_json_other_encode.py", line 16, in <module> load_job.result() # Waits for table load to complete. File "/home/ec2-user/test_bq/lib/python3.7/site-packages/google/cloud/bigquery/job.py", line 818, in result return super(_AsyncJob, self).result(timeout=timeout) File "/home/ec2-user/test_bq/lib/python3.7/site-packages/google/api_core/future/polling.py", line 127, in result raise self._exception google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: Failed to parse JSON: Unknown token type; Unexpected token; Parser terminated before end of string
クライアントライブラリのパラメータでエンコードの指定ができるようなので、UTF-16 を指定して再度ロードを試みてみます。(指定可能なエンコードに UTF-16 がないことは確認しましたが。。
- google.cloud.bigquery.job.Encoding | Python Client for Google BigQuery
- google.cloud.bigquery.job.LoadJobConfig | Python Client for Google BigQuery
(省略) job_config = bigquery.LoadJobConfig() job_config.autodetect = True job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON job_config.encoding = 'UTF-16' uri = "gs://test-mikami/data_test/generated_l_utf16.json" load_job = client.load_table_from_uri( uri, dataset_ref.table("load_json_utf16"), job_config=job_config ) # API request (省略)
(test_bq) [ec2-user@ip-10-0-43-239 test_load]$ python load_json_other_encode.py Traceback (most recent call last): File "load_json_other_encode.py", line 13, in <module> uri, dataset_ref.table("load_json_utf16"), job_config=job_config File "/home/ec2-user/test_bq/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 1593, in load_table_from_uri load_job._begin(retry=retry, timeout=timeout) File "/home/ec2-user/test_bq/lib/python3.7/site-packages/google/cloud/bigquery/job.py", line 640, in _begin retry, method="POST", path=path, data=self.to_api_repr(), timeout=timeout File "/home/ec2-user/test_bq/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 556, in _call_api return call() File "/home/ec2-user/test_bq/lib/python3.7/site-packages/google/api_core/retry.py", line 286, in retry_wrapped_func on_error=on_error, File "/home/ec2-user/test_bq/lib/python3.7/site-packages/google/api_core/retry.py", line 184, in retry_target return target() File "/home/ec2-user/test_bq/lib/python3.7/site-packages/google/cloud/_http.py", line 423, in api_request raise exceptions.from_http_response(response) google.api_core.exceptions.BadRequest: 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/cm-da-mikami-yuki-258308/jobs: Invalid value for jobConfigurationLoad.encoding: UTF-16 is not a valid value
今度は分かりやすいエラーメッセージで怒られました。。(はい、ごめんなちゃいw
サポートされていない日付フォーマットの JSON データを BigQuery にロード
BigQuery に JSON データをロードする場合、日付フォーマットはハイフン( - )区切りにする必要があるとのことです。
では、サポートされていないスラッシュ( / )区切りなどのフォーマットだった場合、ロードエラーになるのでしょうか? それとも、データ型を STRING 扱いでよしなにロードしてくれるのでしょうか?
JSON ファイルの日付項目の Value をスラッシュ区切りに書き換えてロードしてみました。
{"_id":"5e9e78da2a1ce10414295d99","index":0,(省略),"registered":"2020/02/08 01:46:15",(省略)} {"_id":"5e9e78da668e337086eaff73","index":1,(省略),"registered":"2016/07/18 09:47:59",(省略)} {"_id":"5e9e78da6d422ce2fd29895c","index":2,(省略),"registered":"2018/04/27 12:25:15",(省略)} {"_id":"5e9e78daf9617236cb600016","index":3,(省略),"registered":"2016/06/29 02:36:01",(省略)} {"_id":"5e9e78da017481351e9c4b9b","index":4,(省略),"registered":"2014/08/27 09:53:16",(省略)} {"_id":"5e9e78da567659cf2f112f14","index":5,(省略),"registered":"2017/05/29 09:28:57",(省略)} {"_id":"5e9e78da72c32bfc6f591542","index":6,(省略),"registered":"2017/08/06 05:21:27",(省略)}
(test_bq) [ec2-user@ip-10-0-43-239 test_load]$ python load_json_timeformat.py Starting job 6fc2de29-aa42-48ac-8892-d6e380b1c098 Job finished. Loaded 7 rows. (test_bq) [ec2-user@ip-10-0-43-239 test_load]$ bq query --use_legacy_sql=false 'SELECT registered FROM `cm-da-mikami-yuki-258308`.dataset_2.load_json_timeformat ORDER BY index' Waiting on bqjob_r2807c5141c9b56c0_000001719d2062eb_1 ... (0s) Current status: DONE +---------------------+ | registered | +---------------------+ | 2020-02-08 01:46:15 | | 2016-07-18 09:47:59 | | 2018-04-27 12:25:15 | | 2016-06-29 11:36:01 | | 2014-08-27 18:53:16 | | 2017-05-29 18:28:57 | | 2017-08-06 14:21:27 | +---------------------+
あれ? ちゃんと TIMESTAMP でロードしてくれてる?
テーブルスキーマも確認してみましたが、TIMESTAMP 型になってます。
(test_bq) [ec2-user@ip-10-0-43-239 test_load]$ bq show --schema --format=prettyjson cm-da-mikami-yuki-258308:dataset_2.load_json_timeformat [ (省略) { "description": "bq-datetime", "mode": "NULLABLE", "name": "registered", "type": "TIMESTAMP" }, (省略) ]
では、スラッシュ以外のフォーマットではどうでしょう? 一部の年月日区切り文字を、ピリオド( . )に変更してロードしてみました。
{"_id":"5e9e78da2a1ce10414295d99","index":0,(省略),"registered":"2020.02.08 01:46:15",(省略)} {"_id":"5e9e78da668e337086eaff73","index":1,(省略),"registered":"2016/07/18 09:47:59",(省略)} {"_id":"5e9e78da6d422ce2fd29895c","index":2,(省略),"registered":"2018/04/27 12:25:15",(省略)} {"_id":"5e9e78daf9617236cb600016","index":3,(省略),"registered":"2016/06/29 02:36:01",(省略)} {"_id":"5e9e78da017481351e9c4b9b","index":4,(省略),"registered":"2014/08/27 09:53:16",(省略)} {"_id":"5e9e78da567659cf2f112f14","index":5,(省略),"registered":"2017/05/29 09:28:57",(省略)} {"_id":"5e9e78da72c32bfc6f591542","index":6,(省略),"registered":"2017/08/06 05:21:27",(省略)}
(test_bq) [ec2-user@ip-10-0-43-239 test_load]$ python load_json_timeformat.py Starting job 9eed966d-0c3e-4bc9-bed5-c7d15ae046de Job finished. Loaded 7 rows. (test_bq) [ec2-user@ip-10-0-43-239 test_load]$ bq query --use_legacy_sql=false 'SELECT registered FROM `cm-da-mikami-yuki-258308`.dataset_2.load_json_timeformat_3 ORDER BY index' Waiting on bqjob_r2cb89ab099467a27_000001719d28b1d6_1 ... (0s) Current status: DONE +---------------------+ | registered | +---------------------+ | 2020.02.08 01:46:15 | | 2016/07/18 09:47:59 | | 2018/04/27 12:25:15 | | 2016/06/29 02:36:01 | | 2014/08/27 09:53:16 | | 2017/05/29 09:28:57 | | 2017/08/06 05:21:27 | +---------------------+ (test_bq) [ec2-user@ip-10-0-43-239 test_load]$ bq show --schema --format=prettyjson cm-da-mikami-yuki-258308:dataset_2.load_json_timeformat_3 [ (省略) { "mode": "NULLABLE", "name": "registered", "type": "STRING" }, (省略) ]
エラーではじかれることはありませんでしたが、さすがに今度は STRING 型としてロードされました。
日付フォーマットがサポート外でもロードエラーになることはありませんが、TIMESTAMP 型としてロードしたい場合には、ロードデータの日付フォーマットを確認しておく必要がありそうです。
まとめ(所感)
JSON データを STRING 型の1つのカラムに格納し、パース関数を使って参照することができるデータベースサービスはよく聞きますが、ネスト構造のテーブルスキーマを定義でき、通常のカラムと同じような感覚でアクセスできる BigQuery は、やはり JSON データを格納する場合には非常に便利だと思いました。(BigQuery でも JSON データを1つのカラムに格納&パース関数を使ったアクセスも可能です。
非サポートの圧縮形式や文字エンコードのデータをロードしようとした場合のエラーメッセージが若干分かりにくいと感じましたが、BigQuery ではあらかじめテーブルを作成しておかなくても JSON フォーマットから必要なカラムやデータ型を自動検出してテーブル作成してくれるので、とても使いやすいと思いました。